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Abstract 

Background: This work describes a system for identifying event mentions in bio-molecular research abstracts that 
are either speculative (e.g. analysis of IkappaBalpha phosphorylation, where it is not specified whether 
phosphorylation did or did not occur) or negated (e.g. inhibition of IkappaBalpha phosphorylation, where 
phosphorylation did not occur). The data comes from a standard dataset created for the BioNLP 2009 Shared Task 
The system uses a machine-learning approach, where the features used for classification are a combination of 
shallow features derived from the words of the sentences and more complex features based on the semantic 
outputs produced by a deep parser. 

Method: To detect event modification, we use a Maximum Entropy learner with features extracted from the data 
relative to the trigger words of the events. The shallow features are bag-of-words features based on a small sliding 
context window of 3-4 tokens on either side of the trigger word. The deep parser features are derived from parses 
produced by the English Resource Grammar and the RASP parser. The outputs of these parsers are converted into 
the Minimal Recursion Semantics formalism, and from this, we extract features motivated by linguistics and the 
data itself All of these features are combined to create training or test data for the machine learning algorithm. 

Results: Over the test data, our methods produce approximately a 4% absolute increase in F-score for detection of 
event modification compared to a baseline based only on the shallow bag-of-words features. 

Conclusions: Our results indicate that grammar-based techniques can enhance the accuracy of methods for 
detecting event modification. 



Introduction 

This paper describes an automatic system for the recog- 
nition of bio-molecular events in biomedical literature. 
We base our research on the data from the BioNLP 
2009 Shared Task [1], where events are defined relative 
to trigger words of different types, and the goal is to 
both identify the trigger words, and infer the role that 
each trigger word plays in a given event. As an illustra- 
tion of the task, consider the input sentence: 

[protein TRADD]i ivds the only protein that [tnggeiinteracted]4 with , v 

wild — type [protein TESljj and not with isoleucine — mutated [protein TES2]^. ^ ' 



where the words indicated in square brackets have 
been pre-identified as proteins as part of the Task speci- 
fication. The event structure for this sentence, as 
defined in the shared task gold standard annotatations, 
is exemplified below: 

Event evt\ 

TYPE = BINDING 

TRIGGER = [tnggQrinteracted]4 (2) 
THEMEi = [protein TiMDD]i 
THEME2 = [protein TES2]2 
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Event evt2 

TYPE = BINDING 

TRIGGER = [tnggeYinteracted]^ (3) 

THEMEi = [proteinTiMDD]i 
THEME2 = [proteinTES2]3 

Modification modi 

TYPE = Negation (4) 
THEME = evt2 

The important points to note are: (a) events are 
defined as w-tuples via a unique trigger word and a 
number of arguments, indexed to words in the original 
text; (b) coordination (in the object NP in (1)) poten- 
tially leads to multiple event tuples; and (c) negated 
events are represented as nested structures, in the form 
of the base (unnegated) event and a meta-operator scop- 
ing over that event. 

The shared consists of three component tasks. The 
first challenge (Task 1) identifies the trigger word for 
each event, together with its main arguments, e.g. (2) 
and (3) from above. Task 2 is devoted to the enrichment 
of events by identifying secondary arguments that 
further specify the event (e.g. LOCALISATION). Finally, 
Task 3 is focused on the identification of two types of 
event modification: Negation (such as (4)) and 
Speculation; Speculation modification indicates that 
the event was hedged or speculative (e.g. analysis of 
IkappaBalpha phosphorylation, where it is not clear 
whether IkappaBalpha phosphorylation occurred or 
not). Our primary interest in this paper is in Task 3, 
building on the outputs of Task 1. 

There has been increasing attention in the detection 
of Negation and Speculation in scientific literature 
[2-4]. The importance of NEGATION, for example, can 
be illustrated via (1) above and a literature search task. 
Assume that the user were interested in identifying all 
biomedical papers which describe binding-style interac- 
tion between TRADD and isoleucine-mutated TES2. A 
classic text retrieval system would be able to identify 
that the abstract containing (1) refers to these two pro- 
tein types, without being able to capture the nature of 
the interaction between them. Paired with a Task 1 -style 
system, it would additionally be able to identify that this 
abstract specifically discusses binding-style interaction 
between the two proteins (and hence enhance retrieval 
precision). Only in combination with the predictions of 
a Task 3-style system, however, would it be able to addi- 
tionally predict that this sentence describes the absence 
rather than presence of interaction, and hence not use 
this sentence as the basis of retrieving this abstract 
(further improving precision). 



One of our principal interests is in the contribution of 
parsers to Task 3 performance. In essence. Task 3 
involves determining the scope of Negation and 
Speculation operators over the event predicates pre- 
dicted in Task 1. A parser which provides scoping infor- 
mation as first- order outputs seems, intuitively, to be an 
ideal solution to the task, and we seek to verify empiri- 
cally that this intuition fits with the actuality of the task. 
As part of this exploration, we experiment with two par- 
sers of varying linguistic precision and coverage (the 
English Resource Grammar and RASP), which we com- 
pare with a standalone bag-of-words baseline, and var- 
ious hybrid techniques. 

The contributions of this paper are as follows: (1) we 
experiment with a range of parsers, individually and in 
combination; (2) we compare our Task 3 systems to a 
bag-of-words baseline, in addition to hybridising bag-of- 
words and parsing features; and (3) we combine our 
Task 3 classifiers with a range of Task 1 systems, and 
systematically investigate the interaction between Task 1 
and Task 3. In this, we achieve the best published 
results to date for Negation over the Task 3 test set, 
while for SPECULATION we achieved a score bettered by 
only one system in the original shared task. 

Related work 

For the purposes of this paper, we treat Task 1 (trigger 
word detection) as a black box, and base our event modi- 
fication classifiers on the output of a range of Task 1 sys- 
tems from the original BioNLP 2009 Shared Task, 
namely: the best-performing Task 1 system of UTurku 
[5], the second-best Task 1 system of JULIELab [6], and 
the mid-ranking Task 1 system of NICTA [7]. For the 
majority of our experiments, we use the output of 
UTurku exclusively. 

In the original shared task, only 6 systems participated 
in Task 3, of which 4 were based on hand-crafted rules 
operating over parser output, and developed based on 
the training data. The exceptions were the systems of [8] 
and [7]. The first system relied on decision trees trained 
over the BioScope corpus [4], which was specifically 
designed for the development of methods for detecting 
instances of event Speculation and Negation. The 
second system used a deep parser together with a 
machine learner, but did not combine parsers as we do 
and used a limited feature set. 

The best performance for event modification in the 
original shared task was obtained by ConcordU [9], with 
a hand-coded grammar built on top of a syntactic par- 
ser. For Speculation they relied on active cognition 
verbs to define their syntactic patterns, while hand- 
picked clue words provided the rules for Negation 
detection. 
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The systems presented in [10-12] also relied on hand- 
crafted rules built by analysing the training data. [10] 
extended their ontology-driven pattern matching 
approach from Task 1, which suffered from low recall. 
[11] applied regular expressions to identify Negation 
trigger expressions, and defined rules based on deep 
parsing for both Negation and Spegulation. [12] also 
built hand-crafted rules, and observed that most of the 
errors came from their Task 1 output, which provided 
less than 30% of the events necessary for full recall in 
Task 3. 

One finding to come from the original shared task was 
the strong co-dependence between Task 1 and Task 3 
results, i.e. that it is hard to perform well at Task 3 
without a strong Task 1 system. We investigate this 
phenomenon relative to a selection of Task 1 systems in 
the 'Results and discussion' section. 

Outside the shared task, other work has looked at 
tasks involving subsets of the modification types in Task 
3. The Negex system [13] was developed as a general- 
purpose method for identifying phrasal Negation in 
medical texts, and relies on regular expressions over 
pre-identified trigger words, so is highly compatible with 
the BioNLP 2009 Shared Task data. The system was 
developed over clinical records, and for our experiments 
we used the version 2 implementation from http://code. 
google. com/p/negex. More recently, the CoNLL 2010 
Shared Task [14] was concerned with detection of spec- 
ulative language (or hedging), but not Negation, in bio- 
medical text. The first subtask required identification of 
speculative sentences, and the second subtask required 
identification of the hedging cues and determination of 
their scope, with a set a biomedical articles as the pri- 
mary focus. The second subtask in particular overlaps 
somewhat with Task 3 of the BioNLP 2009 Shared 
Task, where the participants were probably implicitly 
identifying cues and their scope (i.e. whether trigger 
words fell within it) to an extent. The task participants 
had some success in applying syntactic analyses to the 
problem - for example, [15] used dependency parses 
and deep LFG parses. Cue words were identified using a 
machine learning approach with mostly shallow features, 
and then a set of hand-crafted rules based on the syn- 
tactic analyses was applied to the cue words to identify 
their scope. 

Methods 

Our basic approach is to parse the data, and construct 
feature vector inputs to a machine learner from the par- 
ser output(s). We build separate classifiers for each of 
the two subtasks of Speculation and Negation. In 
this section, we describe the parsers and feature extrac- 
tion methodology. 



Deep parsing with the ERG 

Intuitively, we would expect deep syntactico-semantic 
analysis to be useful in detecting both event Negation 
and Spegulation, as knowledge of the relationships of 
possibly distant elements (such as the Negation parti- 
cle not) to a particular target word can provide valuable 
information for classification. Indeed, as noted above, 
syntactic analysis of some kind was found to be useful 
for this task (e.g. [9]) and related tasks (e.g. [15]) 

Further to this, it was our intention to evaluate the 
utility of deep parsing [16] for the task, rather than a 
shallower annotation such as the output of a depen- 
dency parser. With this in mind, we selected the English 
Resource Grammar {ERG: [17,18]), an open-source, 
broad-coverage high-precision grammar of English in 
the HPSG framework [19]; the experiments reported in 
this paper are based on the '0902' version of the gram- 
mar. We combine the ERG with the PET parsing engine 
[20] in this work. 

While the ERG is relatively robust across different 
domains, it is a general-purpose resource, and there are 
some aspects of the language used in the biomedical 
abstracts that cause difficulties; unknown word handling 
is especially important given the nature of terms in the 
domain. Fortunately we can make some optimisations to 
mitigate this. The GENIA tagger [21] provides both 
POS and named entity annotations, which we used to 
constrain the input to the ERG in two ways, using the 
chart-mapping machinery of [22] : 

♦ Biological named entities identified by the GENIA 
tagger are 'flagged as such, and the parser does not 
attempt to decompose them. 

♦ POS tags are appended to each input token to con- 
strain the token to an appropriate category if it is 
absent from the ERG lexicon. 

In addition to producing parse trees and full Attri- 
bute-Value Matrices, the ERG can also produce output 
in particular semantic formalisms: Minimal Recursion 
Semantics (MRS: [23]) and the closely-related Robust 
Minimal Recursion Semantics (RMRS: [24]). For our 
feature generation here we make use of the latter, due 
to its compatibility with shallower parsers such as RASP. 

While the ERG has various grammar-internal mechan- 
isms for increasing coverage (e.g. allowing subject- verb 
number mismatch), it does not have any facility to con- 
struct parse fragments in the instance that no spanning 
parse is found for a given input. This inevitably restricts 
the coverage of the grammar, and in the case of the 
BioNLP 2009 Shared Task data, the sentence-level cov- 
erage was found to be a respectable but still imperfect 
76%. Clearly, a fallback strategy is required for the 24% 
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of sentences the ERG is unable to parse. Some methods 
to achieve this are discussed in the next section. 

Extending parse coverage 

One obvious approach to augment the ERG and gain full 
coverage over all inputs is to combine it with a more sim- 
plistic bag-of-words approach, and this is indeed some- 
thing we investigate. However in line with our intuition 
and experimental evidence that syntactico-semantic fea- 
tures are useful for the task, we also investigated improv- 
ing the coverage by adding an alternative parser. 

One obvious choice would be any of the dependency 
parses provided by the organisers of the shared task. 
These parses have some advantages - broad coverage over 
the data and being tuned to the biomedical domain being 
obvious ones. However, we wished to leverage off our pre- 
vious feature engineering work with deriving salient indi- 
cators from RMRSs, and hence opted to use RASP [25], a 
broad-coverage general-purpose statistical parser, applying 
the method of [26] to generate RMRSs. 

In our setup, we did not allow fragment analyses from 
RASP due to the difficulty of converting them to RMRS 
outputs and doubts about their reliability. Similar to our 
approach with the ERG^ we only used the top-ranked 
parse. 

Under these conditions, we found that RASP was able 
to achieve similar coverage to the ERG, obtaining a parse 
for 76% of the sentences in the development data. How- 
ever, by taking the union of the sets of parseable inputs 
from the two parsers, we increase our sentence-level cov- 
erage to 93% of the development set, making features 
derived from RMRSs a far more realistic prospect as a 
means of detecting event modification. 

Feature extraction from RMRSs 

Figure 1 shows an RMRS obtained from one of the 
training documents. While there is insufficient space to 
give a complete treatment here, we highlight several 
aspects of importance to this paper. The primary com- 
ponent of an RMRS is bag of elementary predicates, or 
EPs. Each EP has: 

1. A label, such as 1104, The label indices are unique 
but arbitrarily assigned by the grammar. As such, 
they do not necessarily start at zero, and generally 
have increments of greater than one. 

2. A predicate name, such as _differentiation_n_l. 
The n before the final digit indicates the word is a 
noun; a v denotes a verb, an a denotes an adjective 
or adverb, and a q denotes a quantifier, such as a 
determiner. 

3. Character indices to the source sentence, such as 
(130:146), indicating the predicate corresponds to 
characters 130 to 146 in the source text. 



4. A set of arguments. 

Arguments can be variables such as eSO or x23 (where 
the first letter indicates the nature of the variable - e 
referring to events, x referring to entities and u indicat- 
ing underspecified), or handles such as h33. The first 
argument is always ARGO and is afforded special status, 
generally referring to the variable introduced by the pre- 
dicate. Subsequent arguments are labelled according to 
the relation of the argument to the predicate. For open- 
class predicates such as verbs, these are non-committal 
names of the form ARGn, but follow certain conven- 
tions - for example, in English, the (deep) subject of the 
verb is generally ARGl and the (deep) object is ARG2. 
Some closed-class words such as determiners and con- 
junctions follow different conventions for argument 
naming - this is visible for the udef_q_rel quantifiers in 
Figure 1. These handles are generally used in the qeq 
constraints, which relate a handle to a label, indicating a 
particular kind of outscoping relationship between the 
handle and the label - either that the handle and label 
are equal or that the handle is equal to the label apart 
from one or more quantifiers occuring between the two 
(the name is derived from 'equality modulo quantifiers'). 
Finally there are in-g constraints which indicate that 
labels can be treated as equal. For our purposes this 
simply affects which qeq constraints they participate in - 
for example from the in-g constraint 128 in-g 1104 and 
the qeq constraint h27 qeq 128, we can also infer that 
h27 qeq 1104, 

In constructing features, we make use of: 

♦ The outscopes relationship (specifically qeq-out- 
scopes) - if EP A has a handle argument which qeq- 
outscopes the label of EP B, A is said to immediately 
outscope B) outscopes is the transitive closure of this. 
For example, in Figure 1, the EP 13: _thus_a_l has a 
handle argument h4 as its ARGl, which in combina- 
tion with the qeq constraint h4 qeq 11 7 means that 13 
immediately outscopes the EP 117: neg_rel. Similarly 
117 in turn immediately outscopes I20:_require_v_l, 
From the transitive closure, we can use both of these 
to infer that 13 also outscopes 120, since 13 outscopes 
something (in this case 11 7) which in turn outscopes 
120, 

♦ The shared-argument relationship, where EPs C 
and D refer to the same variable in one or more of 
their argument positions. For instance in Figure 1, 
128: compoundjrel shares its ARGl with the ARGO 
of 1104: _differentiation_n_of, as both slots are filled 
by the same variable x23. We also in some cases 
make further restrictions on the types of arguments 
{ARGO, RSTR, etc) that may be shared on either end 
of the relationship. 
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11, 

{ 13: _thiis_a_l (62:67) (e5, ARGl: h4), 

116: iifkappa b_iin(68:78)(xll), 

16: udef_q_rel(68:89)(x9, RSTR: h8, BODY: li7), 

110: compound_rel(68:89)(el2, ARGl: x9, ARG2: xll), 

113: udeLq_rel(68:89)(xll, RSTR: hl5, BODY: hl4), 

1101: ^tivation_n_l (79:89) (x9), 

117: neg_rel(94:97)(el9, ARGl: hl8), 

120: _require_v_l (98:106) (e2, ARGl: u21, ARG2: x9), 

1102: parg_d_rel(98:106)(e22, ARGl: e2, ARG2: x9), 

1103: _for.p(107:110)(e24, ARGl: e2, ARG2: x23), 

134: iicviroblastoiiia cclLiin ( 1 1 1 : 1 29) (x29) , 

125: udeLq_rel(lll:146)(x23, RSTR: h27, BODY: h26), 

128: compound_rel(lll:146)(e30, ARGl: x23, ARG2: x29), 

131: udef_q_rel(lll:146)(x29, RSTR: h33, BODY: h32), 

1104: _differentiation_n_of(130:146)(x23, ARGl: u35) }, 

{ li4 qeq 117, hS qeq 110, hl5 qeq 116, hl8 qeq 120, 
li27 qeq 128, h33 qeq 134 }, 

{ 110 in-g 1101, 120 in-g 1102, 120 in-g 1103, 128 in-g 1104 } 

Figure 1 A sample RMRS. RMRS representation of the sentence Thus NF-kappa B activation is not required for neuroblastonna cell differentiation 
showing, in order, elementary predicates (each consisting of a label, predicate name, character span and arguments), qeq-constraints, and in-g 
constraints. The unlabelled first argument of each predicate is the mandatory ARGO argument, which is closely linked to the predicate. The 
'udef_q_rer predicates are default quantifiers introduced to keep the RMRS well-formed, which do not have directly corresponding words in the 
sentence. 



Feature sets and classification 

Feature vectors for a given event are constructed on the 
basis of the trigger word for that event, which we 
assume has already been identified. We use the term 
trigger EPs to describe the EP(s) which correspond to 
that trigger word - i.e. those whose character span 
encompasses the trigger word. We have a potentially 
large set of related EPs (with the kinds of relationships 
described above), which we filter to create the various 
feature sets, as outlined below. 

The following features are used to identif)^ Negation. 
In each case, a general feature is set (e.g. NegOut- 
scope2), as well as a specific one for the matching predi- 
cate. 

♦ NegOutscope2: an EP in the RMRS belongs to a 
set of nine semantically negative predicates (e.g. 
_unable_a or _never_a) determined by manual 
examination of a small subset of the development 
data, and that EP outscopes a trigger EP. 

♦ NegConjIndex: an EP in the RMRS belongs to a set 
of three negative conjunctions {_nor_c, _not_c and 



_but-\-not_c) identified from the grammar, and the 
negated daughter(s) of that EP are the ARGO of a 
trigger EP. 

♦ ArgONegOutscopeeSA: an EP has an argument that 
matches the ARGO of a trigger EP, which is in turn 
outscoped by the same set of negative EPs as for 
NegOutscope2. This feature is designed to catch 
trigger EPs which are nouns, where the dominating 
predicate is semantically negated. 

♦ TrigPredProps: the predicate name of each trigger 
EP, as well as its POS. 

The following are the features to identify Speculation. 
Once again, in each case, both a general feature and a spe- 
cial predicate-based feature are set. 

♦ SpecVObj2+WN: one of a pre-identified seed set of 
six speculative verbs is found (e.g. _test or jtnvesti- 
gate), where its ARG2 (i.e. object) is the ARGO of a 
trigger EP. We additionally include WordNet sisters of 
the speculative verbs, and in the case that a WordNet 
sister matches, add an additional feature for the seed 
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speculative verb (in the original set). The seed verbs 
were identified by examining a subset of the develop- 
ment data. 

♦ ModalOutscope: a modal verb (e.g. should) out- 
scopes a trigger EP. 

♦ AnalysisSA: the ARGO of the trigger EP is also an 
argument of an EP with the predicate name _analy- 
sis_n. Such constructions involving the word analysis 
are relatively frequent in speculative events in the 
data. 

♦ ModAdj: any adjectival or adverbial EPs which 
have an ARGl (corresponding to the modified noun 
or verb) which matches the ARGO of the trigger EP 
(i.e. which are modifiers). 

Both of these feature sets are the same as the most 
successful 'N4' and 'S3' feature sets used in [7]. 

Combining RMRSs from different sources 

Given that we have multiple potential sources of RMRSs 
to create feature vectors, there are several possible ways 
to combine them. The first is a fallback method. We have 
more confidence in the ERG parses and their ability to 
produce RMRSs for a number of reasons: the ERG is a 
deeper grammar, in contrast to the deliberate shallow- 
ness of the RASP parser, so we would expect, where it 
can find a parse, that its analyses would contain more 
useful information; additionally RMRS is closer to a 
native format for the ERG, as it is constructed composi- 
tionally as part of the parsing process, rather than in a 
post-processing step, as is the case with RASP. On the 
basis of this, we use the fi^G-derived RMRS where it is 
available, and where it isn't, fall back to the RMRS 
derived from RASP. 

An alternative is to place equal confidence in both 
sources of RMRSs. Each sentence will have zero, one, or 
two RMRSs available. In the first case where we have one 
RMRS, we construct features from it as usual. Where 
there are two RMRSs, we construct features from each, 
and take the union to form a single feature vector. A var- 
iant on this method is to produce the same merged out- 
put if there are multiple input RMRSs, but also produce a 
version of each feature that is tagged with the source of 
the RMRS. The intuition here is that while there are 
some commonalities between the RMRS outputs, each 
grammar may have different strengths and weaknesses in 
terms of producing RMRSs, so it may be useful for the 
machine learning algorithm to have (indirect) knowledge 
of which grammar produced the particular feature. 

Bag-of-words features 

To evaluate the performance boost obtained through par- 
sing relative to more naive methods, we also experimented 



with feature sets based on a bag-of-words approach with a 
sliding context window of tokens on either side of the 
token corresponding to the trigger, as determined by the 
tokenisation of the GENIA tagger, without crossing sen- 
tence boundaries. We evaluated a range of combinations 
of preceding and following context window sizes from 0 
to 5 (never crossing sentence boundaries), and optimised 
the window size for each of the Speculation and 
Negation subtasks. 

The bag-of-words context-window is robust and gives 
100% coverage, so it gives us a chance at classifying the 
sentences which are not parseable using either parser. It 
is also possible that even on sentences we can parse 
with the ERG and/or RASP, the event modifications it 
can detect are at least partially complementary to those 
that are detectable with the RMRS-derived features, sug- 
gesting a combined approach. 

Classifier implementation 

To produce training data to feed into a classifier, we 
parsed as many sentences as possible using the ERG and/ 
or RASP, and used the output RMRSs to create training 
data, relying on the features described above. The con- 
struction of features, however, presupposes annotations 
for the events and trigger words. For producing training 
data, we used the provided trigger annotations. For the 
test phase, we simply use the outputs of the various Task 
1 classifiers as a source of trigger annotations, selecting 
the combination with the best performance over the 
development set. We used a maximum entropy classifier, 
by applying Zhang Le's Maxent Toolkit (http://home- 
pages.inf.ed.ac.uk/lzhanglO/maxent_toolkit.html). 

Results and discussion 

We test the impact of the two parsers - individually and in 
combination - on SPECULATION and NEGATION event 
modification over both the development data and over the 
test data provided for the shared task. The reason we use 
both datasets is that evaluation over the test data is possi- 
ble only via a web form, with the restriction that only one 
run can be evaluated in each 24 hour time period, to 
maintain the sanctity of the test data. We thus carried out 
extensive experiments over the development data to fine- 
tune our feature engineering by applying combinations of 
different feature sets, including many not reported here. 
We apply only a representative set of classifiers to the test 
data. To explore the impact of the Task 1 results (trigger 
word detection) on event modification, we make use of 
the following Task 1 systems for the development and/or 
test datasets. Note that gold-standard annotations are not 
available for the test dataset, meaning that any experi- 
ments requiring gold-standard data can only be performed 
over the development dataset. 
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♦ The output of the UTurku system [5], which was 
the best-performing Task 1 system in the original 
shared task [DEV and TEST] 

♦ The gold-standard annotations, to evaluate our 
methods in isolation of Task 1 classifier noise [DEV 
only] 

♦ The outputs of the JULIELab and NICTA Task 1 
classifiers, to explore the impact of Task 1 classifier 
performance on event modification [DEV only] 

Results over the development data 

We present first the baseline bag-of-words results, and 
then our parser-based systems. 
Bag-of-words baseline 

We first carried out a series of experiments with differ- 
ent window sizes for the bag-of-words method over the 
development data, to determine the optimal window 
size for each of the NEGATION and SPECULATION sub- 
tasks. Using gold-standard Task 1 data and optimising 
over event modification F-score, we found that the opti- 
mal window size for SPECULATION was three words to 
either side of the event trigger word (signified as W~3^), 
at an F-score of 48.3%. For Negation, the marginally 
wider window size of four words to the left and three 
words to the right (signified as W~3 ) produced the opti- 
mal F-score of 53.3% over the development data (once 
again based on gold-standard Task 1 annotations). Per- 
haps the most surprising thing about this relatively 



uninformed baseline is how well it can perform. These 
window size settings are used exclusively in the bag-of- 
words experiments presented in the remainder of this 
paper for the respective subtasks. 
RMRSs and parser combination 

In Table 1 we present the results over the development 
data using the UTurku classifier and gold-standard Task 1 
annotations. We additionally include results for the rule- 
based Negex system [13] described earlier in the paper, as 
a benchmark for the Negation subtask. Recall that both 
the ERG and RASP have imperfect coverage over the data, 
meaning that in cases where bag-of-words features are not 
employed, the feature vector will consist of all negative 
features, and the classifier will fall back on the class priors 
to classify the instance in question. 

Firstly, for the pure RMRS-based features, there are 
obvious differences between the methods of RMRS con- 
struction. The standalone ERG produces respectable 
performance in NEGATION and acceptable performance 
in Speculation in relation to the baselines. In line with 
our predictions, the standalone ERG produces superior 
performance to the standalone RASP. 

In terms of strategies for combining the features from 
different RMRSs, it seems that the fallback strategy (fb) 
is most effective: creating an RMRS from the ERG 
where possible, and otherwise from RASP produces a 
substantial performance boost over the standalone ERG 
strategy, which is consistent across SPECULATION and 



Table 1 Results over development set 
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Results over the development data using gold-standard Task 1 annotations and the UTurku Task 1 system ("fb" = fallback strategy, where we use the first source 
if possible, otherwise the second; "cb" = use undifferentiated RMRSs from each source to create feature vectors). 
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W~3^, and both the gold-standard and UTurku outputs. 
This is interesting as for only 17% of the sentences in 
the data was there a RASP parse and not an ERG parse. 
It seems that there is relatively good compatibility 
between the features produced from these different 
RMRSs, so that features learnt from RASP-derived 
RMRSs can be used for jEi^G-derived RMRS output and 
vice versa. The strategy which combines every possible 
parse obtained from the ERG and RASP (cb) is generally 
less effective, with the one exception of Negation, 
where bag-of-words features are combined with the 
RMRS features. In fact, in the majority of cases, cb with- 
out bag-of-word features is inferior to using the ERG as 
a standalone parser. 

When we combine the bag-of-words features with the 
RMRS-derived features, the results always improve over 
the equivalent RMRS results without bag-of-words, with 
recall being the primary benefactor. The cb strategy 
appears to benefit most from the addition of the bag-of- 
words features. 

Comparing our results over the NEGATION subtask to 
NegeXy it is evident that all results incorporating the 
ERG and/or bag-of-words features outperform this 
benchmark rule-based system, which is highly 
encouraging. 

We were surprised by the effectiveness of the bag-of- 
words approach in comparison to our more informed 
techniques, particularly for NEGATION, where the simple 
bag-of-words baseline was superior to all other methods 
when combined with the UTurku Task 1 classifier. 
Nonetheless, the parsing techniques are clearly shown to 
have some utility (bearing in mind that there are still 7% 
of sentences which cannot be parsed under this setup 
thus will not be classified correctly from RMRS-derived 
features). However there is possibly room for improve- 
ment in the remaining 93% of sentences which we can 
parse - our results in Table 1 are still well below 93% 
recall. 

We have not performed any analysis to verify whether 
the number of events per sentence differs between 
parseable and unparseable sentences. Longer sentences 
tend to be harder to parse, and may contain a larger 
number of sentences by virtue of their length, meaning 
that the true limit may be lower. 

Results over the test data 

In the testing phase, we repurposed all of the develop- 
ment data as extra training data, and retrained using 
some of the promising combinations of RMRS sources 
and bag-of-words feature vectors. These results are pre- 
sented in Table 2. Note that we are not able to evaluate 
over gold-standard Task 1 data, as it has not been 
released for the test data. 



Table 2 Results over test set 
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Results over the test data using the UTurku Task 1 system ("fb" = fallback 
strategy, where we use the first source if possible, otherwise the second; "cb" 
= use undifferentiated RMRSs from each source to create feature vectors), t 
denotes the feature set which performed best over the development set 
using gold Task 1 annotations. 



The results here are not always what we would expect 
on the basis of the development results. The bag-of- 
words baseline continues to be an impressive performer 
for Negation, achieving an F-score of 24.0% with the 
UTurku data compared with 29.0% over the develop- 
ment data. However the combination of the aggregated 
RMRS approaches and the bag-of-words features out- 
performed bag-of-words. 

The Speculation results show noticeably different 
behaviour from the development data. The primary differ- 
ence seems to be that the bag-of-words baseline (at least 
for the context window that we selected) is of little use in 
comparison to the RMRS features. Encouragingly, the best 
result was obtained with a pure parser-based approach (fb 
{ERGyRASP))y and bag-of-words on its own was the poor- 
est performer, with an F-score around half that of the par- 
ser-based method. This effect is even visible when 
combining the bag-of-words with the RMRS output, 
which resulted in a substantial decrease in F-score. Exam- 
ining further, we can see that the bag-of-words recall is 
particularly low over SPECULATION, so it seems that the 
local contextual cues for Speculation which were 
learned from the training and development data are simply 
not present in the accessible events in the test data, while 
the longer distance syntactic dependencies are still clearly 
useful. 

In terms of overall performance in comparison to the 
original submissions to the shared described in [1], 
these results are respectable. If we had been required to 
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choose only one run for each of SPECULATION and 
Negation, the features would have been selected on 
the basis of the development set figures with gold Task 
1 annotations (another option would be to use the best 
automatically created Task 1 annotations) - these figures 
are marked with 't' in Table 2. For Speculation, we 
would have submitted the fb{ERG,RASP), system 
to give an F-score of 14.71% giving results higher than 
the second-placed team, but well behind the score of 
ConcordU of 25.27% (the best performer over the test 
set would have been closer to the ConcordU perfor- 
mance, but this is not a fair comparison to make as it 
takes advantage of knowing scores over the test data). In 
the Negation subtask, using this technique would have 
selected the same parameters which gave the best test 
set performance, giving an F-score of 27.71% - higher 
than the top-ranked ConcordU score of 23.13%. Of 
course, in both cases these results rely on high- 



performing Task 1 systems from third parties which is 
important for Task 3 results, as we discuss below. 

Interaction between Task 1 and Task 3 

There is a clear interaction between Tasks 1 and 3 in our 
pipeline architecture, in that if there is an error in the 
Task 1 output for an event where there is Speculation 
or Negation, we have no way of correcting that mistake 
in our Task 3 classifier. What is less clear is the statistical 
nature of this interaction. To investigate this question, we 
plotted Task 3 performance relative to the performance of 
each of the three base Task 1 systems (UTurku, JULIELab 
and NICTA), over the various combinations of features. 
The results for NEGATION and SPECULATION are pre- 
sented in Figure 2 and Figure 3, respectively. It is apparent 
from the two graphs that the correspondence is roughly 
linear, meaning that the relative gain in Task 3 F-score is 
roughly equivalent for every 1% gain in absolute F-score 
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Figure 2 Task 3 against Tasl^ 1 for SPECULATION Task 3 F-score against Task 1 F-score for SPECULATION over the different 
combinations of Task 1 and Task 3 systems on tine development set. 
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Figure 3 Task 3 against Tasl^ 1 for NEGATION- Task 3 F-score against Task 1 F-score for NEGATION, over the different combinations of 
Task 1 and Task 3 systems on tine development set. 



for Task 1. In the case of both Speculation and 
Negation, the slope of the various curves is relatively 
consistent at around 0.5, suggesting that it is possible to 
achieve a 1% increase in Task 3 F-score by boosting the 
Task 1 F-score by 2%. Of course, each of the curves in 
these graphs are based on only four data points, and there 
is inevitable noise in the output, but a rough linear trend 
is clearly demonstrated. 

In the feature engineering stage, we primarily used the 
oracle data for Task 1 to maximise the amount of train- 
ing data available. We felt that if we were to use our 
Task 1 classifications for events and trigger words, the 
effectively lower number of training instances would 
only hurt performance. However this possibly led to a 
bias towards features which were more useful for classi- 
fying events that were not successfully classified by the 
Task 1 system. The development set shows similar per- 
formance drops under these conditions in Table 1. 



Conclusions 

We have presented a method for detecting event 
Speculation and Negation in bio-molecular literature, 
based on the BioNLP 2009 Shared Task data. We take a 
pipeline approach, in first detecting event trigger words and 
arguments (Task 1), then identifying occurrences of event 
modification based on this output (Task 3). Our method 
interprets modifier scope via the semantic output of the 
ERG and/or RASP, and presents this to a machine learner 
in the form of a linguistically-rich feature vector, which was 
optionally combined with bag-of-words features. We 
demonstrated that our parser-based approach was superior 
to a bag-of-words model for Speculation, achieving the 
best-published results over the Speculation subtask in 
the process. Surprisingly, for NEGATION, the simple bag-of- 
words approach was superior to all parser-based classifiers 
over the development data, but for the test data, the parsers 
achieved a higher F-score. 
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List of abbreviations used 

MRS: Minimal Recursion Semantics, a semantic formalism; RMRS: Robust 
Minimal Recursion Semantics, a formalism closely related to MRS; EP: 
Elementary Predicate, a unit of meaning in an MRS or RMRS; ERG: The 
English Resource Grammar, a handcrafted precision grammar of English; 
RASP: Robust Accurate Statistical Parser, a general purpose parser for English. 
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